In [ ]:
 

Clipping Outliers: NYC hotel pricing dataset analysis¶

Sometimes, while analyzing a dataset, there can be some data present which might exert undue influence while building models, like linear regression. These data are called outliers. Outliers can sometimes mislead the set of data and influence model performance as well.

What are outliers?

In data science, outliers are values within a dataset that vary greatly from the others, they are either much larger, or significantly smaller. Outliers can appear in a dataset due to variability of measurement, error in data, experimental error etc. Outliers can cause machine learning models to make inaccurate predictions when they are included in the training data, so they need to be handled before training a model.

One of the best ways to understand outliers is box plots.

Boxplots are very useful to see the distribution of a variable/feature and detect outliers in them. It is a useful graphical representation for describing the behavior of the data in the middle as well as both ends of the distribution. A box plot shows the data based on the five-number summary:

  • Minimum: the lowest data point in a variable excluding any outliers
  • Median (Q2 or 50th percentile): the middle value in the variable
  • First quartile (Q1 or 25th percentile): also known as the lower quartile (0.25)
  • Third quartile (Q3 or 75th percentile): also known as the upper quartile (0.75)
  • Maximum: the highest data point in the variable excluding any outliers

Interquartile Range:

The difference between the lower quartile and the upper quartile(Q3 - Q1) is called the interquartile range or IQR.

Boxplots help us find the outliers in the data by using the IQR. As a rule, values that are outside the range of 1.5*IQR from Q1 and Q3 are regarded as outliers. The below image will help us better understand the outliers in our data.

boxplot icon

In the image above, the points that are outside the whisker lines are the outliers.

There are different techniques to handle outliers in a dataset. In our example, we will use the concept of clipping (winsorizing).

What is winsorizing/clipping?

Clipping data from a dataset means to clip the data at the last permitted extreme value, e.g. the 5th or 95th percentile value. For example, when we clip the data to 95th percentile, values over the 95th percentile will be set to the 95th percentile value meaning all the values greater than 95% percent will equal to the 95th percentile value.

The following data set has several (bolded) extremes:

  • {0.1, 1, 12, 14, 16, 18, 19, 21, 24, 26, 29, 32, 33, 35, 39, 40, 41, 44, 99, 125}

After clipping/winsorizing the top and bottom 10% of the data(matching those values to the nearest extreme), we get:

  • {12, 12,12, 14, 16, 18, 19, 21, 24, 26, 29, 32, 33, 35, 39, 40, 41, 44, 44, 44}

Let us solve a problem that removes outliers from data using clipping.

Problem Description:¶

For illustration of the clipping method, lets look at an example.

We have a dataset named nyc_airbnb.csv , which contains data about price of AirBnb per-night rental houses. In the dataset, there exists some outliers in the price column. Our task is to find out the outliers and handle them by winsorizing/clipping.

First , we load our dataset "New York Housing" into a dataframe and view it.

Load the Dataset and View data:¶

Step 1: import the pandas library as pd

In [1]:
import pandas as pd

Step 2: Load the data into a variable nyc using read_csv method in pandas

In [2]:
nyc= pd.read_csv("../datasets/nyc_airbnb.csv")

Step 3: View the variable nyc.

In [3]:
nyc
Out[3]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48890 36484665 Charming one bedroom - newly renovated rowhouse 8232441 Sabrina Brooklyn Bedford-Stuyvesant 40.67853 -73.94995 Private room 70 2 0 NaN NaN 2 9
48891 36485057 Affordable room in Bushwick/East Williamsburg 6570630 Marisol Brooklyn Bushwick 40.70184 -73.93317 Private room 40 4 0 NaN NaN 2 36
48892 36485431 Sunny Studio at Historical Neighborhood 23492952 Ilgar & Aysel Manhattan Harlem 40.81475 -73.94867 Entire home/apt 115 10 0 NaN NaN 1 27
48893 36485609 43rd St. Time Square-cozy single bed 30985759 Taz Manhattan Hell's Kitchen 40.75751 -73.99112 Shared room 55 1 0 NaN NaN 6 2
48894 36487245 Trendy duplex in the very heart of Hell's Kitchen 68119814 Christophe Manhattan Hell's Kitchen 40.76404 -73.98933 Private room 90 7 0 NaN NaN 1 23

48895 rows × 16 columns

Check for outliers in price data:¶

Plot a strip plot for outlier estimation:¶

Step 1: Import the plotly.express library as px

In [4]:
import plotly.express as px

Step 2: Using px, call the strip() method to generate the strip plot

  • Inside the method, the parameters will be,
    • nyc: variable where the data is stored
    • price: column data to plot in the y axis
  • Store the result into a variable price_strip that will save the plot in this variable
In [5]:
price_strip = px.strip(nyc, y='price')

Step 3: Display the variable price_strip using the show() method

In [6]:
price_strip.show()

Use boxplot to estimate outliers:¶

Step 1: Call the boxplot() method that will generate the boxplot

boxplot()

Step 2: Inside the method, the parameters will be,

- `column`: the column data to plot for the boxplot 
- `figsize`(optional): to define the size of the figure in terms of width and height
- `fontsize`(optional): to show the texts size in the figure
- `vert`(optional): the allignment(x or y axis) of the plot. Value `False` means horizontal(x axis) alignment, `True` vertical alignment 
boxplot(column='price',figsize=(10,5), fontsize='8', vert=False)

Step 3: Apply the boxplot() method to the variable nyc, where our data is stored

nyc.boxplot(column='price',figsize=(10,5), fontsize='8', vert=False)

Step 4: Store the result in a variable box_price

In [7]:
box_price = nyc.boxplot(column='price', figsize=(15,5), fontsize='10', vert=False)

Observe price distribution in terms of numbers:¶

Step 1: Select the price column from the variable nyc

nyc['price']

Step 2: Use the describe() method on the price data. This will show the price data distribution on the five number summary

In [8]:
nyc['price'].describe()
Out[8]:
count    48895.000000
mean       152.720687
std        240.154170
min          0.000000
25%         69.000000
50%        106.000000
75%        175.000000
max      10000.000000
Name: price, dtype: float64

Find the Outliers:¶

Calculate Q3:¶

Step 1: Select the price column from the variable nyc

nyc['price']

Step 2: Use the quantile() method on the price data

nyc['price'].quantile()

Step 3: Inside the quantile method, the parameter will be,

  • the percentile value whose datapoint we want to find(75% for q3)
nyc['price'].quantile(.75)

Step 4: Store the result in a variable q3

In [9]:
q3= nyc['price'].quantile(0.75)

Step 5: Print the variable q3

In [10]:
print("q3:",q3)
q3: 175.0

Calculate Q1:¶

Step 1: Select the price column from the variable nyc

nyc['price']

Step 2: Use the quantile() method on the price data

nyc['price'].quantile()

Step 3: Inside the quantile method, the parameter will be,

  • the percentile value whose datapoint we want to find(25% for q1)
nyc['price'].quantile(.25)

Step 4: Store the result in a variable q1

In [11]:
q1= nyc['price'].quantile(0.25)

Step 5: Print the variable q1

In [12]:
print("q1:",q1)
q1: 69.0

Find the interquartile range (IQR):¶

Step 1: Substract q3 from q1

q3 - q1

Step 2: Store the result in a variable iqr

In [13]:
iqr= q3 - q1

Step 3: Print the variable iqr

In [14]:
print("iqr:",iqr)
iqr: 106.0

Calculate the upper and lower bound for outliers:¶

For upper bound,

Step 1: Define the range q3 + 1.5*iqr

q3 + 1.5*iqr

Step 2: Store it in a variable upper_bound

In [15]:
upper_bound= q3 + 1.5*iqr

Step 3: Print the upper_bound

In [16]:
print("upper bound",upper_bound)
upper bound 334.0

For lower bound,

Step 1: Define the range q1 - 1.5*iqr

q1 - 1.5*iqr

Step 2: Store it in a variable lower_bound

In [17]:
lower_bound= q1 - 1.5*iqr

Step 3: Print the lower_bound

In [18]:
print("lower bound",lower_bound)
lower bound -90.0

Clip the Outliers:¶

Find the clipping points:¶

for lower_point,

Step 1: Use the max() function

max()

Step 2: In max(), the function parameters will be,

  • lower_bound: calculated in the previous step
  • nyc['price'].min() : the minimum value in the nyc['price'] data
max(lower_bound, nyc['price'].min())

Step 3: Store that function into variable lower_point

In [19]:
lower_point= max(lower_bound, nyc['price'].min())

Step 4: Print the variable lower_point

print("lower_point", lower_point)
In [20]:
print("lower_point", lower_point)
lower_point 0

For upper_point,

Step 1: Use the max() function

min()

Step 2: In min(), the function parameters will be,

  • upper_bound: calculated in the previous step
  • nyc['price'].max() : the maximum value in the nyc['price'] data
min(upper_bound, nyc['price'].max())

Step 3: Store that function into variable upper_point

In [21]:
upper_point= min(upper_bound, nyc['price'].max())

Step 4: Print the variable upper_point

print("upper_point", upper_point)
In [22]:
print("upper_point", upper_point)
upper_point 334.0

Clip outliers using the clipping points:¶

Step 1: Select the price column from nyc dataframe

nyc['price']

Step 2: Call the clip() method on the price column

nyc['price'].clip()

Step 3: Inside the clip method, set the parameters as,

  • lower_point: the lower point of price data
  • upper_point: the upper point of price data
nyc['price'].clip(lower_point, upper_point)

Step 4: Set the result to the price column of nyc dataframe to make the changes permanent

In [23]:
nyc['price'] = nyc['price'].clip(lower_point, upper_point)

Check clipped data distribution:¶

Step 1: Select the price column from the variable nyc

nyc['price']

Step 2: Use the describe() method on the price data. This will show the price data distribution on the five number summary

In [24]:
nyc['price'].describe()
Out[24]:
count    48895.000000
mean       132.979753
std         83.530504
min          0.000000
25%         69.000000
50%        106.000000
75%        175.000000
max        334.000000
Name: price, dtype: float64

Visualize after clipping data:¶

Using strip plot:¶

Step 1: Using px, call the strip() method to generate the strip plot

  • Inside the method, the parameters will be,
    • nyc: variable where the data is stored
    • price: column data to plot in the y axis
  • Store the result into a variable price_strip2 that will save the plot in this variable
In [25]:
price_strip2 = px.strip(nyc, y='price')

Step 2: Display the variable price_strip using the show() method

In [26]:
price_strip2.show()

Using boxplot:¶

Step 1: Call the boxplot() method that will generate the boxplot

boxplot()

Step 2: Inside the method, the parameters will be,

- `column`: the column data to plot for the boxplot 
- `figsize`(optional): to define the size of the figure in terms of width and height
- `fontsize`(optional): to show the texts size in the figure
- `vert`(optional): the allignment(x or y axis) of the plot. Value `False` means horizontal(x axis) alignment, `True` vertical alignment 
boxplot(column='price',figsize=(10,5), fontsize='8', vert=False)

Step 3: Apply the boxplot() method to the variable nyc, where our data is stored

nyc.boxplot(column='price',figsize=(10,5), fontsize='8', vert=False)

Step 4: Store the result in a variable box_price2

In [28]:
box_price2= nyc.boxplot(column='price', figsize=(10,5), fontsize='8', vert=False)

Conclusion¶

By using the clip method, we have removed our outliers from the price data. Now using this dataset will give us good predictions of hotel prices.